AITopics

Country:

North America > United States > California (0.14)
North America > United States > Virginia (0.04)
South America > Brazil (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)
Banking & Finance > Insurance (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Neural Information Processing SystemsDec-27-2025, 01:24:07 GMT

Data Minimization at Inference Time

In high-stakes domains such as legal, banking, hiring, and healthcare, learning models frequently rely on sensitive user information for inference, necessitating the complete set of features. This not only poses significant privacy risks for individuals but also demands substantial human effort from organizations to verify information accuracy. This study asks whether it is necessary to use all input features for accurate predictions at inference time. The paper demonstrates that, in a personalized setting, individuals may only need to disclose a small subset of features without compromising decision-making accuracy. The paper also provides an efficient sequential algorithm to determine the appropriate attributes for each individual to provide. Evaluations across various learning tasks show that individuals can potentially report as little as 10\% of their information while maintaining the same accuracy level as a model that employs the full set of user information.

data minimization, inference time, name change, (4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Neural Information Processing SystemsOct-9-2025, 10:09:26 GMT

e48880ea81caa7836e6a0694049093ae-Paper-Conference.pdf

artificial intelligence, core feature, machine learning, (17 more...)

Country:

North America > United States > California (0.14)
North America > United States > Virginia (0.04)
South America > Brazil (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)
Banking & Finance > Insurance (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Zhou, Jijie, Mireshghallah, Niloofar, Li, Tianshi

Operationalizing Data Minimization for Privacy-Preserving LLM Prompting

arXiv.org Artificial IntelligenceOct-7-2025

The rapid deployment of large language models (LLMs) in consumer applications has led to frequent exchanges of personal information. To obtain useful responses, users often share more than necessary, increasing privacy risks via memorization, context-based personalization, or security breaches. We present a framework to formally define and operationalize data minimization: for a given user prompt and response model, quantifying the least privacy-revealing disclosure that maintains utility, and we propose a priority-queue tree search to locate this optimal point within a privacy-ordered transformation space. We evaluated the framework on four datasets spanning open-ended conversations (ShareGPT, WildChat) and knowledge-intensive tasks with single-ground-truth answers (CaseHold, MedQA), quantifying achievable data minimization with nine LLMs as the response model. Our results demonstrate that larger frontier LLMs can tolerate stronger data minimization while maintaining task quality than smaller open-source models (85.7% redaction for GPT-5 vs. 19.3% for Qwen2.5-0.5B). By comparing with our search-derived benchmarks, we find that LLMs struggle to predict optimal data minimization directly, showing a bias toward abstraction that leads to oversharing. This suggests not just a privacy gap, but a capability gap: models may lack awareness of what information they actually need to solve a task.

large language model, machine learning, natural language, (21 more...)

2510.03662

Country: North America > United States (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Leysen, Jens, Favier, Marco, Goethals, Bart

What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems

arXiv.org Artificial IntelligenceSep-1-2025

Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity' difficult to implement.

artificial intelligence, history, minimization, (16 more...)

doi: 10.1145/3746252.3761058

2508.21547

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California (0.68)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Statutes (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)

Neural Information Processing SystemsAug-22-2025, 00:54:31 GMT

ac6b3cce8c74b2e23688c3e45532e2a7-Paper.pdf

data minimization, data mining, machine learning, (21 more...)

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Slovenia (0.04)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Staab, Robin, Jovanović, Nikola, Mai, Kimberly, Ganesh, Prakhar, Vechev, Martin, Fioretto, Ferdinando, Jagielski, Matthew

SoK: Data Minimization in Machine Learning

arXiv.org Artificial IntelligenceAug-15-2025

Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and \emph{DM-adjacent} methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.

artificial intelligence, machine learning, natural language, (19 more...)

2508.10836

Country:

Europe (0.92)
North America > United States (0.67)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)

Hanson, Mathias, Lewkowicz, Gregory, Verboven, Sam

Engineering the Law-Machine Learning Translation Problem: Developing Legally Aligned Models

arXiv.org Artificial IntelligenceApr-25-2025

Organizations developing machine learning-based (ML) technologies face the complex challenge of achieving high predictive performance while respecting the law. This intersection between ML and the law creates new complexities. As ML model behavior is inferred from training data, legal obligations cannot be operationalized in source code directly. Rather, legal obligations require "indirect" operationalization. However, choosing context-appropriate operationalizations presents two compounding challenges: (1) laws often permit multiple valid operationalizations for a given legal obligation-each with varying degrees of legal adequacy; and, (2) each operationalization creates unpredictable trade-offs among the different legal obligations and with predictive performance. Evaluating these trade-offs requires metrics (or heuristics), which are in turn difficult to validate against legal obligations. Current methodologies fail to fully address these interwoven challenges as they either focus on legal compliance for traditional software or on ML model development without adequately considering legal complexities. In response, we introduce a five-stage interdisciplinary framework that integrates legal and ML-technical analysis during ML model development. This framework facilitates designing ML models in a legally aligned way and identifying high-performing models that are legally justifiable. Legal reasoning guides choices for operationalizations and evaluation metrics, while ML experts ensure technical feasibility, performance optimization and an accurate interpretation of metric values. This framework bridges the gap between more conceptual analysis of law and ML models' need for deterministic specifications. We illustrate its application using a case study in the context of anti-money laundering.

artificial intelligence, legal obligation, machine learning, (16 more...)

2504.16969

Country:

North America > United States (0.46)
Europe > Austria (0.28)

Genre: Research Report (0.82)

Industry:

Law > Statutes (1.00)
Law > Civil Rights & Constitutional Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > Europe Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

arXiv.org Artificial IntelligenceMar-7-2025

Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams

Shaowang, Ted, Liu, Shinan, Marques, Jonatas, Feamster, Nick, Krishnan, Sanjay

Machine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. By processing sensor data, machine learning models can optimize processes, improve efficiency, and enhance personalized user experiences in smart systems. However, IoT systems are often deployed in sensitive environments such as households and offices, where they may inadvertently expose identifiable information, including location, habits, and personal identifiers. This raises significant privacy concerns, necessitating the application of data minimization -- a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary "relevant and necessary" rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.

data minimization, dataset, identifiability, (12 more...)

2503.05675

Country:

North America > United States > California (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
Europe > Austria > Vienna (0.14)
(7 more...)

Genre: Research Report (0.50)

Industry:

Information Technology > Smart Houses & Appliances (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.30)

arXiv.org Artificial IntelligenceJan-7-2025

Rescriber: Smaller-LLM-Powered User-Led Data Minimization for Navigating Privacy Trade-offs in LLM-Based Conversational Agent

Zhou, Jijie, Xu, Eryue, Wu, Yaoyao, Li, Tianshi

The proliferation of LLM-based conversational agents has resulted in excessive disclosure of identifiable or sensitive information. However, existing technologies fail to offer perceptible control or account for users' personal preferences about privacy-utility tradeoffs due to the lack of user involvement. To bridge this gap, we designed, built, and evaluated Rescriber, a browser extension that supports user-led data minimization in LLM-based conversational agents by helping users detect and sanitize personal information in their prompts. Our studies (N=12) showed that Rescriber helped users reduce unnecessary disclosure and addressed their privacy concerns. Users' subjective perceptions of the system powered by Llama3-8B were on par with that by GPT-4o. The comprehensiveness and consistency of the detection and sanitization emerge as essential factors that affect users' trust and perceived protection. Our findings confirm the viability of smaller-LLM-powered, user-facing, on-device privacy controls, presenting a promising approach to address the privacy and trust challenges of AI.

information, large language model, machine learning, (18 more...)

2410.11876

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > New York (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Personal > Interview (0.92)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)